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Abstract 



The PageRank is a popularity measure designed by Google to rank 
Web pages. Experiments confirm that the PageRank obeys a 'power 
law' with the same exponent as the In-Degree. This paper presents a 
novel mathematical model that explains this phenomenon. The relation 
between the PageRank and In-Degree is modelled through a stochastic 
equation, which is inspired by the original definition of the PageRank, 
and is analogous to the well-known distributional identity for the busy 
period in the M/G/1 queue. Further, we employ the theory of regular 
variation and Tauberian theorems to analytically prove that the tail be- 
havior of the PageRank and the In-Degree differ only by a multiplicative 
factor, for which we derive a closed-form expression. Our analytical re- 
sults are in good agreement with experimental data. 

Keywords: PageRank, In-Degree, Power Law, Regular Variation, Stochas- 
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1 Introduction 

The notion of PageRank was introduced by Google in order to numerically 
characterize popularity of Web pages. The original description of the PageRank 
presented in [9j is as follows: 
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where PR{i) is the PageRank of page i, dj is the number of outgoing links 
of page j, the sum is taken over all pages j that link to page i, and c is the 
"damping factor", which is some constant between and 1. From this equation 
it is clear that the PageRank of a page depends on the number of pages that 
link to it and the importance (i.e. PageRanks) of these pages. 

In this paper we study the relation between the probability distribution of 
the PageRank and the In-Degree of a randomly selected Web page, where the In- 
Degree denotes simply the number of incoming hyperlinks of a Web page. Pan- 
durangan et al. [T5] observed that the probability distributions of the PageRank 
and the In-Degree for Web data have a similar asymptotic behavior, or, more 
precisely, they seem to follow power laws with the same exponent. Loosely 
speaking, a 'power law' with exponent a means that the probability that the 
random variable takes some large value x is proportional to x~°' . For the PageR- 
ank and the In-Degree distribution, the exponent a is approximately 2.1. 

Recent extensive experiments by Donate et al. [M] and Fortunato et al. [I2] 
confirmed the similarity in tail behavior observed in [18] . Becchetti and 
Castillo [S] extensively investigated the influence of the damping factor c on the 
power law behavior of the PageRank. They have shown that the PageRank of 
the top 10% of the nodes always follows a power law with the same exponent 
independent of the value of the damping factor. Our own experiments based on 
Web data from [5T] are also in agreement with [TS] (see Figure 1). 

Obviously, equation ^ suggests that the PageRank and the In-Degree are 
intimately related, but this formula by itself does not explain the observed sim- 
ilarity in tail behavior. Furthermore, the linear algebra methods that have been 
commonly used in the PageRank literature [71 [TSj and proved very successful for 
designing efficient computational methods, seem to be insufficient for modelling 
and analyzing the asymptotic properties of the PageRank distribution. 

The goal of our paper is to provide mathematical evidence for the power-law 
behavior of the PageRank and its relation to the In-Degree distribution. We 
propose a stochastic model that aims to explain this phenomenon. Our approach 
is inspired by the techniques from applied probability and stochastic operations 
research. The relation between the PageRank and the In-Degree is modelled 
through a distributional identity which is analogous to the equation for the 
busy period in the M/G/1 queue (see e.g. [H]). Further, we analyze our model 
using the approach employed in [1^ for studying the tail behavior of the busy 
period in case the service times are regularly varying random variables. This fits 
in our research because regular variation is in fact a generalization of the power 
law, and it has been widely used in queueing theory to model self-similarity, 
long-range dependence and heavy tails [20]. Thus, we use the notion of regular 
variation to model the power law distribution of the In-Degree. For the sake of 
completeness, in Section[5J we will introduce regularly varying random variables 
and describe their basic properties. 

To obtain the tail behavior of the PageRank in our model, we use Laplace- 
Stieltjes transforms and apply Tauberian theorems presented in the well-known 
paper by Bingham and Doney [4], see also Theorem 8.1.6 in 5 . Moreover, 
our analysis allows to explicitly derive the constant multiplicative factor that 
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quantifies the difference between the PageRank and the In-Degree tail behavior. 
Our analytical results show a remarkable agreement with real Web data. 

We believe that our approach is extremely promising for analyzing the 
PageRank distribution and solving other problems related to the structural 
properties of the Web. At the end of this paper, we will briefly mention other 
possibilities for probabilistic analysis of the PageRank distribution. In partic- 
ular, we provide experimental results for Growing Networks jl], and draw a 
parallel between the recent studies [31 [TT] on the PageRank behavior in this 
class of graph models and our present work. 



2 Preliminaries 

This section describes important properties of regularly varying random vari- 
ables. We follow definitions and notations by Bingham and Doney [1], Meyer 
and Teugels [151 and Zwart [5D]. More comprehensive details can be found in 

We say that a function V{x) is regularly varying of index a G R if for every 
t > 0, 

— — > t as X Qo. 

V{x) 

If a = 0, then V is called slowly varying. Slowly varying functions are usually 
denoted by L: for every t > 0, 

L{tx) 

— > 1 as .T ^ oo. 

L{x) 

Then, a function V{x) is regularly varying if and only if it can be written in the 
form 

V{x) = x'^L(x), 

for some slowly varying Lix). 

The following lemma provides a useful bound for slowly varying functions. 

Lemma 1. (Potter hounds) Let L he a slowly varying Junction. Then, for 
any fixed A > 1,5 > Q there exists a finite constant K > \ such that for all 

Xi,X2 > K, 

<Amaxl(^]\(^' 

In probability theory a random variable X is said to be regularly varying 
with index (or exponent) a if its distribution function Fx is such that 

Fx{x) :— I — Fx{x) ^ x~"L{x) as a; ^ oo, 

for some positive slowly varying function L{x). Here, as in the remainder of this 
paper, the notation a{x) ~ b(x) means that a{x)/b{x) 1. 
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We denote the Laplace-Stieltjes transform of X by / and the nth moment 
x"dF{x) by the corresponding letter The successive moments of F can 

be obtained by expanding / in a series at s = 0. More precisely, we have the 

following. 

Lemma 2. The nth moment of X is finite if and only if there exist numbers 
fiQ — 1 and /ii, /in, such that 

1=0 

If /i„ < cxD then wc introduce the notation {n G N) 

/„(.) = (-1)"+! (^f{s) - J2 TT (-^')) ■ (2) 

Remark 1. It follows from Lemma\^ that the nth moment of X is finite if and 
only if there exist numbers (Xq — 1 and fii, ...,/i„ such that /„(s) = o(s") as s ^ 
0. 

The following theorem establishes the relation between asymptotic behavior 
of regularly varying distribution and its Laplace-Stieltjes transform. This result 
will play an essential role in our analysis. 

Theorem 1. (Tauberian Theorem) If n ^ N, iin < oo, a = n + /3, /3 £ (0, 1), 
then the following are equivalent 

(^) fn{s) ^ (-l)"r(l - a)s"L(i) as s ^ 0, 

(ii) I ^ F{x) ^ x^" L{x) as x —>■ oo. 

Here and in the remainder of the paper we use the letter a to denote the 
index of a complementary distribution function rather than a density. The 
power law exponent of the In-Degree in the Web graph then becomes 1.1 rather 
than 2.1. 



3 The model 

In this section we introduce a model that describes the relation between the 
PageRank and the In-Degree distributions in the form of a stochastic equation. 
This model naturally follows from the definition of the PageRank ([T]), and is 
analytically tractable for obtaining the asymptotic behavior of the PageRank. 

3.1 Relation between In-Degree and PageRank 

Our goal now is to describe the relation between the PageRank and the In- 
Degree. To this end, we keep equation ([1]) almost unchanged but we make several 
assumptions. First, let R be the PageRank of a randomly chosen page. We 
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treat R simply as a random variable whose distribution we want to determine. 
Second, we assume that the number of outgoing links d is the same for each 
page. Then R satisfies a distributional identity 



where M is the In-Degree of the considered random page. 

We now make the assumption that the Rj's are independent and have the 
same distribution as R itself. We note that the independence assumption is 
obviously not true in general. However, it is also not the case that the PageRank 
values of the pages linking to the same page i are directly related, so we may 
assume independence in this study. 

The novelty of our approach is that we treat the PageRank as a random 
variable which solves a certain stochastic equation. However, this approach is 
quite natural if our goal is to explain the 'power law' behavior of the PageRank 
because the 'power law' is merely a description of a certain class of probability 
distributions. In fact, this point of view is in line with empirical results by 
Pandurangan et al. [18] and other authors who consistently present the (log- 
log) histogram of the PageRank. 

One of the nice features of stochastic equation ([3]) is that it has the same 
form as the original formula ([T]). Thus, we may hope that our model correctly 
describes the relation between the In-Degree and the PageRank. This is easy to 
verify in the extreme (unrealistic) case when all pages have the same In-Degree 
d. In this situation, the PageRanks of all pages are equal, and it is easy to verify 
that R = 1 constitutes the unique solution of ^ . 

3.2 In-Degree Distribution 

It is well-known that the In-Degree of Web pages follows a power law. For our 
analysis however we need a more formal description of this random variable, 
thus, we suggest to employ the theory of regular variation. We model the In- 
Degree of a randomly chosen page as a nonnegative, integer, regularly varying 
random variable, which is distributed as N{T), where T is regularly varying 
with index a and N(t) is the number of Poisson arrivals on the time interval 
[0, t]. Without loss of generality, we assume that the rate of the Poisson process 
is equal to 1. 

The advantage of this construction is that we do not need to impose any 
restrictions on T and at the same time ensure that the In-Degree is integer. We 
claim that the random variable N(T) will also be regularly varying with the 
same index as T, or, more informally, N{T) follows a power law with the same 
exponent. Thus, we can think of N{T) as the In-Degree of a random Web page. 
For the sake of completeness we present the formal statement and its proof in 
the remainder of this section. 

Let Ft and i^Ar(T) , / and (j) be the distribution functions and the Laplace- 
Stieltjes transforms of T and N{T), respectively. Since the random variable T 




M 
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is regularly varying, we have by definition 

1 — Ft{x) ~ x~°'L{x) as a; ^ oo, (4) 

where L{x) is some slowly varying function. Then we will claim that for N(T) 
the following also holds: 

^ — Fn{t){x) ^ x~°'L{x) as X ^ cxj. (5) 

To prove this statement we use the Tauberian theorem (Theorem [T|). In 
order to satisfy the conditions of this theorem, we should first verify whether 
the corresponding moments of T and N{T) always exist together. Assuming 
that ET = d we immediately get EA^(T) = d. Next, consider the generating 
function of 7V(r), 

nOO poo 

Gn^t){s) Es^(^) = / Es^W^FtW = / e-*^^-''> dFrit) = /(I - s), 
Jo Jo 

(6) 

from which we derive the Laplace-Stieltjes transform of N{T) in terms of the 
Laplace-Stieltjes transform of T: 

(j){w) = Ee-'"^^^) = /(I - e-""). 

Now, denote by /ii, . . . , /i„ and ^i, . . . , the first n moments of T and iV(T), 
respectively, and define A'o = = 1- Then we can formulate the next lemma. 

Lemma 3. The following are equivalent 

(i) ^„ < oo, 

(ii) < oo. 

Proof. 

(i) — > (ii) By Lemma [2] we know that /i„ < oo if and only if f{t) can be written 
as 

n 
i=0 

Denote t{s) := 1 — e^'*, then t{s) — > as s — *■ 0, and we can substitute 

n 

cPis) = /(l-e-)=^|(-(l-e-)r + o((l-e-^r) 

i=Q 

n / oo fc \ * 

i=0 ■ ■ / 

which can be written as 

i=0 
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for some finite constants Co = 1 and . . . , that can be expressed in terms 
of /xq = f and /ii, . . . , Thus, by uniqueness of the power series expansion 
and by Lemma [5] we have C„ < oo. 

(m) (i) Similarly, s{t) := — ln(l — t) ^ as t ^ 0, so we obtain 



fit) ^ 0(~ln(l-t))-^|ln'tl-t) + o(ln"(l-i)) 

z^O \ k^l / W k^l / / 

n 



j=0 



for /iQ = 1 and some /^i, . . . that can be expressed in terms of = 1 and 
Ci, . . . , which similarly implies fi„ < oo. □ 

Remark 2. If we define 

/„(s) = (-l)"+M/(,s)-f^^(-,sr ) and 



as in then we can reformulate Lemma\3[ as follows: 

/n(s) = o(s") i/ anrf on^?/ if (f>„ {s) = o(s"). 

Now, we can use Theorem [T] to prove that (jH) implies ([S]). In fact also the 
reverse holds, as stated in the next theorem. 

Theorem 2. The following are equivalent 

(i) Ft{x) ~ x^"L{x) as x — > oo, 

(ii) Fiq(j^-){x) ^ x~°' L{x) as x —>■ oo. 

Proof. 

(i) — > (ii) From Theorem [T] for T we know that 

Ft{x) ^ x^°'L{x), X ^ oo implies 
/„(i)^(-l)"r(l-a)riQ^ as t-.0, (7) 

where a > 1 is not integer and n is the largest integer smaller than a. 
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Since (j)(s) — f{t), by Lemma[3]we have fn{t) (f>n{s), where t{s) = (1 — 
e~*) ~ s, as s — > 0. So, we can obtain from ([7]) by using Lemma [T] that 

</)„(s)~(-l)"r(l-a)s"i 

Now we again apply Theorem [T] to conclude that 

Fn{t) ~ x^°'L{x) as X ^ oo. 
(m) — > (i) Similar to the first part of the proof. □ 

Thus, our model for the number of incoming links properly describes an 
In-Degree distribution that follows a power law with finite expectation and a 
non- integer exponent. 

3.3 The main stochastic equation 

Combining the ideas from Sections 13.11 and 13.21 we arrive to the following equa- 
tion 

N{T) 

R^cY,-R, + {l-c), (8) 

where c G (0, 1) is the damping factor, d e {1, 2, . . .} is the fixed Out-Degree of 
each page, and N{T) describes the In-Degree of a randomly chosen page as the 
number of Poisson arrivals on a regularly varying time interval T. As we dis- 
cussed above, stochastic equation (|8]) adequately captures several important as- 
pects of the PageRank distribution and its relation to the In-Degree. Moreover, 
our model is completely formalized, and thus we can apply analytical methods 
in order to derive the tail behavior of the random variable R representing the 
PageRank. 

Linear stochastic equations like ([8]) have a long history. In particular, ([8]) is 
similar to the famous equation that arises in the theory of branching processes 
and describes many real-life phenomena, for instance, the distribution of the 
busy period in the M/G/\ queue: 

N{S^) 

B= J2 B^ + Si, 

where B is the distribution of the busy period (the time interval during which 
the queue is non-empty), 5*1 is the service time of the customer that initiated 
the busy period, N(S\) is the number of Poisson arrivals during this service 
time and the -B^'s are independent and distributed as B. We refer to [19j and 
other books on queueing theory for more details. Also, see Zwart [50] for an 
excellent detailed treatment of queues with regular variation, and specifically 
the busy period problem. We note also that our equation ([8| is a special case 




8 



in a rich class of stochastic recursive equations that were discussed in detail in 
the recent survey by Aldous and Bandyopadhyay f^. 

This concludes the model description. The next step will be to use our 
model for providing a rigorous explanation of the indicated connection between 
the distributions of the In-Degree and the PageRank. 

4 Analysis 

The idea of our analysis is to write the equation for the Laplace-Stieltjes Trans- 
forms of T and R and then make use of the Tauberian theorems to prove that 
R is regularly varying with the same index as T . According to Theorem [21 this 
will give us the desired similarity in tail behavior of the PageRank R and the 
In-Degree N{T). 

As a result of the assumptions from Section [31 we can express the Laplace- 
Stieltjes transform r(s) of the PageRank distribution R in terms of the proba- 
bility generating function of N{T) using ([5]): 

N(T) 

C 



s) := Ee--'-^ = e-^^i-^^Ecxp -s- ^ R 



d 

i=l 



^ E exp -5^ ^ i?, nN{T) = k) 

oo 

(i-")^ntiIEexp(-s^i?) F{N{T) = k) 



k=l 



e 

fc=i 

Since, by ([6]), 'Gn{t){s) — /(I ~ s), we arrive at 

r(s)=/(l-r(^s))e-«(i-=). (9) 

It can be shown (e.g. arguing as in |TU} Section XIII. 4]) that equation © has 
a unique solution r(s) which is completely monotone and has r(0) = 1 if and 
only a c/d < 1. This inequality is satisfied for the typical values d > I and 
< c < 1. 

As in Section 13.21 we will start the analysis with providing the correspon- 
dence between existence of the n-th moments of T and R. We remind that 
/ii, . . . denote the first n moments of T. Further, denote the first n mo- 
ments of i? by ?7i, . . . , 7y„, and define 



\ k=0 
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as in (2]). Note that taking expectations on both sides of ^ we easily obtain 
Ei? = rji — 1. This follows from the independence of N{T) and the iZ^'s and 
the fact that EN{T) = ET = /ii = d. 
The next lemma holds. 

Lemma 4. The following are equivalent 

(i) fi„ < CO, 

(a) r]n < oo. 

Proof. 

(«) — > (ii) We use induction, starting from n = 1 for which both (i) and (ii) are 
valid. Assume that for fc = 1, 2, . . . , n — 1 it has been shown that («) — > (ii). We 
introduce the following notation, to be used throughout this section. Denote 

gis) := e-'^^-^l and 
tis) := 

Then we can write © as 

ris) ^ f{t)g{s). (10) 

We know from (i) that 



f(t) = i-dt+j2^^^^+o{n 



k=2 



k\ 



Thus, from (fTO)) we obtain 

r{s) - dg{s)r (^.) = (^1 - d + ^ ^^^IT" + °(*")) ^^'^^ ^^^^ 
However, it follows from the induction hypothesis for n — 1 that 

n-l 

r(s) = l-. + ^^(-.'=)+o(."-i), 

k=2 

so we can present t{s) as a sum 



k=l 



Using this, we can actually find t^{s) 



n+k-2 

t\s)= J2 /3ms'+o(s"+"-2), (12) 

i—k 
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for fc > 1 and appropriate constants f3k,i, i = k, . . . , k + n — 2. Thus, we obtain 
by (HB) and 

ris) - dg{s)r (^s) = (j^ ^d'sT + j g{s) 

for appropriate constants 70,..., 7„. Using the expansion of g{s), it is not 
difficult to show that for appropriate constants pq, . . . , pn, we also have 

n 

r(s) - dr (^s\ = ^ p,s' + o(s"). 



In other words, because of the uniqueness of the series expansion, we have 

(r{s) ~ dr (^s))^ - r„(s) - dr„ (^s) = o(s"). (13) 

We will now show that this implies (m), to which end we consider the partial 
sums 

N . . . . s / - fc+l 



fc=0 



r,.(s)-<i»+V,. 



Taking the limit as iV — > 00, we have for the last term that 

'C\N+l 



lim 

Af— ►00 



= lim 

N^oc 



id 



d^ 

N+l 



T lim ( — I 

1 AT^oo \dJ 



{N+l){n~2) 



0, 



where we used the induction hypothesis r„(s) = o(s" ^) together with n > 2, < 
c < 1 and d > 1. It follows that we can express r„(s) as an infinite sum, 



fe=0 



fc+i 



(14) 



where we can apply (fT51) to each of the terms. Further, by definition of o(s"), for 
all e > 0, there exists a. S = 6{e) such that |r„(s) — dr„ (^s)| < es" whenever 
< s < (5. Moreover, for this e and 5, and < s < S, we also have 



.(^)l 



k+i 



< 



k=0 



, k+i 



-es 



(15) 



fe=0 
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Here the second inequality holds because 0<(^)'^s<(5 for every A: > 0. Since 



for every Eq > there exists Sq such that 



for < s < So, then according to ([TS]), we have |r'„(s)| < £os" whenever 
< |s| < (5o, by which we have shown that r„ = o(s"). 

(m) ^ (i) Assume that there exists a nonnegative random variable R satisfy- 
ing Then, obviously, R > 1 — c. Moreover, ([5]) also implies that R is 
stochastically greater than (1 — c) (^7V(T) + l). Hence, the existence of the 
n-th moment of R ensures the existence of the n-th moment of N{T), which in 
turn by Lemma [3] ensures the existence of the n-th moment of T. □ 

d , , 

Remark 3. Note that the stochastic inequality R > (1 — c) (^Af(r) + Ij implies 
that the tail of the PageRank is at least as heavy as the tail of the In-Degree. 

Remark 4. Similar as in Remarkm we can reformulate Lemma^as 

/n(s) = o(s") if and only if r„(s) = o(s"). 

From the first part of the proof of Lemma[4]we also obtain the next corollary. 

Corollary 1. The following holds: 

r„(s)-dr„(^s) = /„(t)+0(r+i). 



Proof. By definitions of r„(s), fn{t), t{s) and Lemma |4l it follows from (fTO| 
that for fixed n, 

fe=2 ■ / 



Because r„(s) — o(s") we can extend (IT^ for fc > 1 and appropriate con- 
stants /3fc,i, I = fc, /c -|- 71 — 1: 

n+k-l 

t^is)^ J2 ^3k,^s'+0{s"+''-'), 
i—k 
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and rewrite the last equation as 

n 

(-ir+v„(.) + 5]^(-.'=) 

n+1 

where tq, . . . ,t„+i are corresponding constants. Now due to the uniqueness of 
the series expansion, we can reduce the above formula to 

rn{s) = /„(i) + dr„ (^s) + (-l)"+V„+is"+i + o(s"+i). 

Then we get: 

□ 

Now we are ready to explain the similarity between the In-Degree and the 
PageRank distributions. The next theorem formalizes this main statement. 

Theorem 3. The following are equivalent 

(i) F^(^rp^{x) x^°'L{x) as x ^ oo, 

c" _ 

fa) Fd(x) ~ -X °'L(x) as X ^ CO. 

Proof. 

[i) (a) From (i) and Theorem [5] it follows that 

Ft{x) ^ x~"L{x) as X ^ oo. (16) 

Theorem [U also implies that (HH) is equivalent to f„{t) - {-l)°'T{l-a)t"L (i), 
where t{s) ^ {c/d)s, as s ^ 0. Then, by Corollary [T] we obtain 

r„(s)-rfr„(^s) ^(-l)'T(l-a)(^)"s"LQ') as s ^ 0. 



Then also for every fc > 0, as s ^ 0, we have 




and from the infinite-sum representation (|14p for r„(s), we directly obtain 



.„(.)^(-irr(l-.)-^(^)%"L(i) as.^0. 
Now we again apply Theorem [U which leads to (ii) . 
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(m) (i) The proof follows easily from (ii) and Corollary [TJ 



□ 



Thus, we have shown that the asymptotic behavior of the PageRank and the 
In-Degree differ only by the multiplicative factor whereas the power law 

exponent remains the same. In the next section we will experimentally verify 
this result. 

5 Numerical Results 
5.1 Power Law Identification 

The identification and measuring of power law behavior is not always simple. 
In this section we provide a brief overview of techniques that we used to plot 
and numerically identify power law distributions. 

The standard strategy is to plot a histogram of a quantity on logarithmic 
scales to obtain a straight line, which is a typical feature of the power law. 
However, this technique is often not efhcient. In [17], Newman clearly illustrated 
that even for generated random numbers with a known distribution the noise 
in the tail region has a strong influence on the estimation of the power law 
parameters. He suggests to plot the fraction of measurements that are not 
smaller than a given value, i.e. the complementary cumulative distribution 
function F{x) = P{X > x) rather than the histogram. The advantage is that we 
obtain a less noisy plot. Besides, this idea is consistent with our analysis in the 
previous section, which was based on complementary cumulative distribution 
functions. We note that if the distribution of X follows a power law with 
exponent a so that ^^(a;) ~ Cx~°', x — > oo, where C is some constant, then 
the corresponding histogram has an exponent a + 1. Thus, the plot of F{x) on 
logarithmic scales has a smaller slope than the plot of the histogram. 

Computing the correct slope from the observed data is also not trivial. Gold- 
stein et al. in [13] . and later Newman in [17j . have proposed to use maximum 
likelihood estimation, which provides a more robust estimation of the power 
law exponent than the standard least-squares fit method. Thus, we compute 
the exponent a using the next formula from [17j : 

a = 1 + 7V (^In^ ) . (17) 

\i=i ^"""/ 

Here the quantities Xi, i — 1, . . . , N, are the measured values of X, and Xmm 
usually corresponds to the smallest value of X for which the power law behavior 
is assumed to hold. 

In the next sections we will present our experiments on real Web Data and on 
a graph that represents a well-known mathematical model of the Web (Growing 
Networks). In both cases, for each value x, we plot in log- log scale the number 
of measurements that are not smaller than x, and we use (|17p to obtain the 
exponents. 
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5.2 Web Data 

To confirm our results on asymptotic similarity between PageRank and In- 
Degree distributions we performed experiments on the public data of the Stan- 
ford Web from . We calculated the PageRanks over a Web graph with 281903 
nodes (pages) and 2.3 million edges (links) using the standard power method 
(see e.g. [T5]). 

There are several papers, see [H], [T3] and [TB], that describe similar 
experiments for different domains and different number of pages, and they all 
confirm that the PageRank and the In-Degree follow power laws with the same 
exponent, around 2.1. In Figure [T] we show the log-log plots for the In-Degree 
and the PageRank of the Stanford Web Data, for different values of the damping 
factor (c = 0.1, 0.5 and 0.9). Clearly, these empirical values of In-Degree and 
PageRank constitute parallel straight lines for all values of the damping factor, 
provided that the PageRank values are reasonably large. It was observed in [B] 
that in general, the PageRank depends on the damping factor but the PageRank 
of the top 10% of pages obeys a power law with the same exponent as the In- 
Degree, independent on the damping factor. This is in perfect agreement with 
our experimental results and the mathematical model, which is focused on the 
right tail behavior of the PageRank distribution. 

The calculations based on the maximum likelihood method yield a slope 
— 1.1 for each of the lines, which verifies that the In-Degree and PageRank 
have power laws with the same exponent a — 1.1 (which corresponds to the 
well known value 2.1 for the histogram). More precisely, we fitted the lines 
y = -l.lx + 5.52, y = -l.lx 4.57, y = -l.lx -f- 4.17, and y = -1.1a; -I- 3.37 
for the plots of the In-Degree and PageRanks with c = 0, 9, c = 0.5 and c = 0.1, 
respectively. We also investigated whether Theorem [3] correctly predicts the 
multiplicative factor 




In Figure[2]we plotted \ogiQ{y{c)) and we compared it to the observed differences 
between the logarithms of the complementary cumulative distribution functions 
of the PageRank and the In-Degree, for different values of the damping factor. 
Here d = 8.2 as in the Web data. We see that theoretical and observed values 
are remarkably close. Thus, our model not only allows to prove the similarity in 
the power law behavior but also gives a good approximation for the difference 
between the two distributions. 

The discrepancy between the predicted and observed values of the multiplica- 
tive factor suggests that our model does not capture the PageRank behavior to 
the full extent. For instance, the assumption of the independence of the PageR- 
ank of pages that have a common neighbor may be too strong. We believe 
however that the achieved precision, especially for small values of c, is quite 
good for our relatively simple stochastic model. 
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Figure 1: Plots for the Web data. Number of pages with In-Degree/PageRank 
greater than x versus x in log-log scale, and the fitted straight lines. 

5.3 Growing Networks 

Growing Networks, introduced by Barabasi and Albert [T], now represent a 
large class of models that are commonly accepted as a possible scenario of Web 
growth. In particular, these models provide a mathematical explanation for the 
power law behavior of the In-Degree . The recent studies [3] , [11] addressed 
for the first time the PageRank distribution in Growing Networks. 

Growing Network models are characterized by preferential attachment. This 
entails that a newly created node connects to the existing nodes with probabil- 
ities that are proportional to the current In-Degrees of the existing nodes. We 
simulated a slightly modified version of this model, where a new link points to 
a randomly chosen page with probability /3, and with probability 1 — /3 the pref- 
erential attachment selection rule is used. This allows us to tune the exponent 
of the resulting power law |17| . 

We simulate our Growing Network using Matlab. We start with d nodes and 
at each step we add a new node that links to d already existing nodes. To ensure 
the same number of outgoing links for all pages, at the end of the simulation, 
we link the first d nodes to randomly chosen pages. In the example presented 
below we set (3 = 0.2 and obtain a network of 50000 nodes with Out-Degree 
d = S. 

In Figure [3] we present the numerical data for the In-Degree and the PageR- 
ank in the Growing Network. Clearly, the Web data from Section 15.21 shows a 
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Figure 2: The theoretical and observed differences between logarithmic asymp- 
totics of the In-Degree and the PageRank. 



much better agreement with our model than the data generated by the preferen- 
tial attachment algorithm. In the next section we briefly compare recent results 
on the PageRank in Growing Networks to our present study and we indicate 
possible directions for further research. 



6 Discussion 



Our model and analysis resulted in the conclusion that the PageRank and the 
In-Degree should follow power laws with the same exponent. Growing Network 
models may provide an alternative explanation [3l [11]. For instance, in the 
recent paper by Avrachenkov and Lebedev [3] it was shown that the expected 
PageRank in Growing Networks follows a power law with an exponent, which 
does depend on the damping factor but equals « 2.08 for c = 0.85. Thus, the 
model in can also be used to explain the tail behavior of the PageRank, but 
it leads to a slightly different result than our model because in our case the 
power law exponent of the PageRank does not depend on the damping factor. 
The reason could be that we focus only on the asymptotics, whereas [S] employs 
a mean-field approximation. Indeed, experiments show that the shape of the 
PageRank distribution does depend on the damping factor, and thus, it may 
affect the average values, whereas the tail behavior remains the same for all 
values of c. 
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Figure 3: Plots for the Growing Network model. Number of pages with In- 
Degree/PageRank greater than x versus x in log-log scale. 

We emphasise that compared to [3l [11] , our model provides a completely 
different approach for modelling the relation between the In-Degree and the 
PageRank. Specifically, we do not make any assumption on the underlying 
Web graph, whereas [31 [TT] choose for the preferential attachment structure, 
thus exploiting the fact that this graph model correctly captures the In-Degree 
distribution. We believe that both approaches should be elaborated and used 
in further research on the PageRank distribution. 

One of the important innovations in the present work is the analogy between 
the PageRank equation and the equation for the busy period that enables us to 
apply the techniques from [16 . In fact, queueing systems with heavy tails and in 
particular the busy period problem allow for a more sophisticated probabilistic 
analysis (see e.g. pQ]). It would be interesting to apply these advanced methods 
to the problems related to the World Wide Web and PageRank. 

Our model definitely lacks the dependencies between the PageRanks of the 
pages sharing a common neighbor. Such dependencies must be present in the 
Web in particular due to the high clustering of the Web graph TT (roughly 
speaking, clustering means that with high probability, two neighbors of the 
same page are connected to each other). Thus, in our further research we could 
try to include some sort of dependencies in our stochastic equation. Another 
natural way to bring our model closer to the real-life situation is to allow random 
(heavy-tailed) Out-Degrees. It would be interesting to investigate in which ways 
these new features will affect the PageRank asymptotics. 
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